git version. If Git is installed, you will get a version number (mine is 2.23.0).git: command not found, then install git:
sudo apt-get install git-core.Git uses a global config file that is hidden in your home directory to track many configuration settings (on a Mac the home directory is typically ~).
To tell Git your name and email address, Open up a terminal window in R Studio and type:
git config --global user.name "YOUR FULL NAME"
git config --global user.email "YOUR EMAIL ADDRESS"If you are worried about email privacy, follow GitHub’s instructions here.
Extra Setup Step for Windows Users To use the Git Bash terminal in R Studio in Windows, you need to make sure it is set as the default.
<https://swcarpentry.github.io/git-novice/02-setup/> has the commands for different editors.
If you want to use Atom,
Go to <https://atom.io/> and install the version for your computer type.
Once you have installed the Atom editor, install an Atom package for the terminal
In the Welcome Guide click on install package and then open installer
Enter platformio-ide-terminal and click install
You can then open up a terminal window with the + sign at the bottom left
To make Atom the default text editor for Git,
git config --global core.editor "atom --wait"git config --list and hit enter to scroll down till you get to the end.q to quit reading and get your cursor back,Version control system: a program which tracks iterative changes of your local files. VCS programs have been around for years to support software development projects
Git is the most popular VCS and “one of the best version control tools available” in 2020
You can go back to previous versions of your code/text, then move forward to the most recent version, or keep the old version.
You can create copies of the code, change them, then merge these copies together later.
You want to try out something new, but you aren’t sure if it will work.
Non-git solution: Copy and rename the files over and over
Issues:
Git lets you change files while automatically keeping track of old versions. It is easy to revert back to old versions if you decide the new changes don’t work.
In a group setting, your collaborators might (will) suggest how to change your analysis/code.
First non-git solution: Email files back/forth.
Second non-git solution: Share a Dropbox or Google Docs folder (a “centralized” version control system).
Git lets each individual work on their own local repository and offer their changes for review in a way that allows you to control which changes get approved to be incorporated into the baseline and then automatically incorporate those changes. Documentation of changes is built into the workflow (so it actually happens!).
In a Kaggle Survey, 58.5% of data scientists say they primarily use Git for sharing code.
Per KDNuggets, Git is #2 of the Top 5 Data Science Skills for 2020
You can make your final-project repo public so prospective employers can view your work.
You can host a website on GitHub, increasing your visibility. Professor Gerard hosts his personal website and teaching websites on GitHub.
Git works by managing file status using three levels:1
Working Directory: The folder where your shell thinks it is. To Git, this means the current versions of the files. Changes to files that you haven’t committed to Git only exist in the working directory and are not yet saved in the history.
Stage: Files that are staged (added) are prepared (scheduled) to be committed to the history, but not yet committed. Only files in the stage will be committed to the history.
History: The time line of file versions (snapshots). You commit a file to the history and then, even if you modify it later, you can always go back to that same file version.
We’ll focus on the right-hand-side of this diagram where the workflow is typically:
The workflow on the left-hand side of the diagram is used to actions from the right when you need to undo mistakes or have changed your mind.
All Git commands begin with git followed immediately by an argument for the type of command you want to execute.
For the right-hand-side of the diagram, the following are the useful Git commands:
git init: Initialize a Git repository. Only do this once per project/repository.git status: Show which files are staged in your working directory, and which are modified but not staged.git add: Add modified files from your working directory to the stage.git diff: Look at how files in the working directory have been modified.git diff --staged: Look at how files in the stage have been modified.git commit -m "[descriptive message]": commit your staged content as a new commit snapshot.A repository (or repo, for short) is a collection of files (in a folder and its subfolders) being version controlled (configuration managed) as a set.
The repo also contains the local version control data, usually in a hidden folder and files.
In data science, each repository is typically one project (like an analysis, a model, a homework, or a collection of code that performs a similar task).
Recommend creating a folder somewhere on your computer called STAT_413 or STAT_613 with three sub-folders for Lectures, Homework, and Project - note no spaces in the names.
Suggest treating Lectures and Project as repositories and create lower level folders for each class period if you wish.
Suggest creating a subfolder under Homework for each week’s assignment and treating subfolder as their own repository (which will usually be how they come with their own set of folders
These repositories will constitute your local repositories within which you will navigate to various working directories to manage your files and sync updates with Git and between Git and GitHub.
We’ll learn some Git as we examine a topic from the famous paper of Oeppen and Vaupel (2002).
Oeppen and Vaupel (2002) found perhaps the strongest association in social science: a linear relationship between year of birth and the maximum life expectancy where the maximum is taken over countries. We’ll examine this relationship for ourselves.
We’ll use the gapminder_unfiltered data frame from the gapminder library. The variables in this data frame are:
country: The name of the country.continent: The continent of the country.year: The year of the measurement. From 1952 to 2007.lifeExp: The life-expectancy of at birth, in years, of an individual.pop: Population.gdpPercap: GDP per capita (US$, inflation-adjusted).Create a folder called life_exp, e.g., under STAT_X13/Lectures/Week01_git.
Create an R Markdown file called “life_exp_analysis.Rmd” within the life_exp folder. Your Rmd might look something like this:
Save `life_exp_analysis.Rmd".
Open up a terminal and navigate so the working directory is STAT_X13/Lectures.
Your R Studio should look something like:
Use the command git init to create a repository.
Now type in the terminal
git initYou’ve just created a Git repository! That means there is a .git hidden folder tracking all of the changes you make for the files you tell it about.
ls -aHowever, Git won’t track any files until you tell it to.
Use git_status to see what files Git is tracking and which are untracked.
git statusThe output should tell you that life_exp_analysis.Rmd is not tracked. In fact you should have no tracked files.
Use git add to add files to the stage.
git add life_exp_analysis.RmdAlways check which files have been added:
git statusUseful flags for git add:
--all will stage all modified and untracked files.--update will stage all modified files, but only if they are already being tracked.Use git commit to commit files that have been staged to create snapshots in the commit history.
The -m argument will allow/require you to make a comment about the commit.
git commit -m "New life exp rmd file."Your message (written after the -m argument) should be concise, and describe what has been changed since the last commit.
If you forget to add a message, Git will open up your default text-editor where you can write down a message, save the file, and exit. The commit will occur after you exit the text editor.
If your default text editor is vim, exit it using “escape” and then type ‘:q’. See this for more options.
git status should now be clear because there are no modified files:
git statusYou can see all of your commits using git log.
git logYou have now completed the workflow on the right side of the Git diagram.
gapminder_unfiltered data frame,gapminder_unfiltered data frame into R, andgroup_by() and filter().Use git diff to see changes in all modified files.
git diffLines after a “+” are being added. Lines after a “-” are being removed.
When there are a lot of lines that fill your terminal window, you can exit git diff by hitting q.
Check the status of your modified files.
Stage your modified files, but don’t commit yet.
Recheck your status.
git diff won’t check for changes in the staged files by default. But you can see the differences using git diff --staged.
git diff
git diff --stagedCommit your changes. Use a nice commit message.
Create a repo on GitHub by selecting “New” on the homepage:
Or go to the “Repositories” tab and select “New”
Tell GitHub the name of your repo. In general, it can be a different name than the repo on your local machine.
Make a small description.
To avoid errors, do not initialize the new repository with README, license, or gitignore files. You can add these files after your project has been pushed to GitHub.
Then click “Create Repository”.
The URL is the location of the repo. It is generally of the form “https://github.com/GHUser/GitHubRepoName.git” where “GitHubRepoName” is whatever you chose to name the repo on GitHub.
The command git remote tells Git to do something associated with a remote repository. In this case we want to add a new one. We need to tell Git the name and location of the added remote repository.
GitHub allows you to copy the URL for your new repository - you can use the Clone or Download button to copy it to your clipboard for pasting into your terminal. It also gives you suggestions for the commands to use.
Use git remote add to tell Git where we will host our repo.
git remote add origin https://github.com/rressler/life_exp_rressler.gitIn the above command, “origin” is just a nickname we gave to the location URL that is hosting our repo. We could have used “github” or “deep_space_nine” instead, but “origin” or “upstream” are traditional names.
Use git push to push commits to GitHub.
If you are pushing to a brand new repo on GitHub, you need to use:
git push -u origin masterThe terminal should ask you for your GitHub username and password.
When the push is complete, your code is now up on GitHub.
Here we are pushing to the remote repository (origin) from our current local branch (master).
origin is for “Other computers” and master is for “My computer”The -u is needed since this is a new repository. It tells Git to connect the behind-the-scenes tracking information between origin (GitHub) and master (git).
–set-upstream option where the “upstream” location is the GitHub origin and it is “upstream” from the local master.For all subsequent pushes for the same repos, once the origin and master repos are connected, you can just type:
git pushA “pull request” is a request in GitHub from you to your partner to “pull” (create a virtual copy of) your code and review it in case they want to incorporate the changes you made into their baseline file in their repository.
Navigate to your forked version of your partner’s repo up on GitHub. There are two ways to generate a new request.
Click on the Pull Requests tab and click on “New pull request”
Write an informative title and message on what your code does, (how it fixes an issue or adds new functionality - why they should use it) then click “Create pull request”.
A pull request is a request to review someone’s code for possible inclusion into the baseline.
If you accept the request once you have reviewed the code, you can initiate a Merge which will update the baseline to include the changes in the submitted code.
Use the “Pull requests” tab on your dashboard or in your repository to view all the pull requests folks have submitted to you.
Navigate to the pull request your partner sent you. Then you can see the changes they made under the “Files changed” tab.
You can write comments — for example asking them to change the code before you accept the merge request.
Or, you can can just accept the merge request by hitting “Merge pull request” and then “Confirm Merge”.
This was a case where there should not have been any merge conflicts. If however, two people are working on the same line of code, that will create a merge conflict that GitHub does not know how to resolve. You will have to decide collectively what should go into the baseline and each update the files to eliminate the conflict.
Once the merge is complete, the screen will update to show the pull request is closed. It will also offer the opportunity to delete the forked repo.
If you are done with it, that is a good practice from a housekeeping perspective. However, if you have multiple issues you are working, do not delete until after all issues are closed.
Now that you have updated the baseline, there may be other changes on the GitHub baseline you want to incorporate back into your copy of the files.
Use git pull whenever there are modifications on GitHub and you want to bring your local repo up-to-date.
Go back to your original directory life_exp. Check the status. It has no idea your baseline file on GitHub has been updated with the graph code. You want to keep your GitHub remote repo and your local repo in sync.
To update your local file with the changes from the remote (GitHub) make sure you are back in the correct working directory. The Git pull command does two things.
fetch command to copy the updated code to your machine and then,merge command to update your local files.If you changed your local files while you also changed your remote files, you could have created a merge conflict you will need to resolve. Git will tell you about it.
git pull
Commit frequently. Use the git add --update to reduce typing when you are just updating the same set of files. On the homework, commit after you complete each question.
Keep your local and remote (GitHub) files in sync to minimize merge conflicts.
When collaborating, communicate who is working on what. Use Issues and Pull Requests for each issue and solve one issue at a time as separate workflows.
Usually, you should only commit plain text files such as .Rmd or HTML not R-generated PDFs or .docx that are easy to reproduce and also clog up your repo storage.
Do Not Push or Upload sensitive data or information. This could include personal identifying information, passwords, SSH private keys, etc..
Oeppen, Jim, and James W. Vaupel. 2002. “Broken Limits to Life Expectancy.” Science 296 (5570): 1029–31. https://doi.org/10.1126/science.1069675.
graphic from Mark Lodato↩︎